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Abstract: This paper presents a novel hardware architecture for principal component 
analysis. The architecture is based on the Generalized Hebbian Algorithm (GHA) because 
of its simphcity and effectiveness. The architecture is separated into three portions: the 
weight vector updating unit, the principal computation unit and the memory unit. In the 
weight vector updating unit, the computation of different synaptic weight vectors shares the 
same circuit for reducing the area costs. To show the effectiveness of the circuit, a texture 
classification system based on the proposed architecture is physically implemented by Field 
Programmable Gate Array (FPGA). It is embedded in a System-On-Programmable-Chip 
(SOPC) platform for performance measurement. Experimental results show that the 
proposed architecture is an efficient design for attaining both high speed performance and 
low area costs. 

Keywords: system on programmable chip; reconfigurable computing; principal component 
analysis; generalized Hebbian algorithm; texture classification; FPGA 
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1. Introduction 

Principal Component Analysis (PCA) [1] plays an important role in pattern recognition, classification, 
computer vision and data compression [2,3]. It is an effective feature extraction technique capable 
of finding a compact and accurate representation of the data that reduces or eliminates statistically 
redundant components. Basic PCA implementation involves the Eigen- Value Decomposition (EVD) of 
the covariance matrix. Long computation time and large storage size are usually required for the EVD. 
The basic PCA therefore is not suited for online computation on the platforms with limited computation 
capacity and storage size. 

To compute the PCA with reduced computational complexity, a number of fast algorithms [2,4-6] 
have been proposed. The algorithm presented in [4] is based on Expectation Maximization (EM). 
The inverse matrix computation is required in the algorithm, which may be an expensive exercise. 
Incremental and/or iterative algorithms for PCA computations are proposed in [2,5,6]. A common 
drawback of these fast PCA methods is that the covariance matrix of training data should be involved. 
The computation time and storage may still be expensive. Although hardware implementation of PCA 
is possible, large storage size and complicated circuit control management are usually necessary. The 
PCA hardware implementation therefore may be used only for data with small dimensions [7-9] when 
limited hardware resource is available. Because of the difficulties for hardware implementation, many 
PCA-based applications use software for the PCA computation. After the eigenvectors are obtained, 
only the projection computation is implemented by hardware [10-12]. 

An alternative for the PCA implementation is to use the Generalized Hebbian Algorithm 
(GHA) [13,14]. The GHA is based on an effective incremental updating scheme without the involvement 
of covariance matrix. The storage requirement for the PCA implementation is then significantly reduced. 
Nevertheless, slow convergence of the GHA is usually observed. A large number of iterations therefore 
is required, resulting in long computational time. An effective approach to expedite the GHA training 
is based on multithreading techniques, which take advantages of all the cores of multicore processors to 
reduce the computational time. However, multicore processors usually consume large power [15], and 
therefore may not be suited for applications requiring low power dissipation. 

Analog hardware implementations of GHA [16,17] have been found to be a power efficient approach 
for accelerating the computational speed. However, these architectures are difficult to be directly used for 
digital devices. A number of digital hardware architectures [18,19] have been proposed for expediting 
the GHA training process. The architecture in [18] separates the weight vector updating process of GHA 
into a number of stages for data reuse. Although the architecture has fast computation time, its hardware 
resource utilization grows linearly with the dimension of data and number of principal components. 
Therefore, the architecture may not be well suited for data with high vector dimension and/or large 
number of principal components. 

A systolic array with low area costs is proposed in [19]. The systolic array is based on pixel- wise 
operations so that the area costs for weight vector updating are independent of vector dimension. 
Nevertheless, the latency of the architecture increases with the dimension of data. Moreover, similar to 
the architecture in [18], the area costs of [19] grow with the number of principal components. Therefore, 
the architecture may still have long latency and high area costs. 
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In light of the facts stated above, a novel GHA implementation capable of performing fast PCA with 
low power consumption is presented. The implementation is based on Field Programmable Gate Array 
(FPGA) because it consumes lower power over its multicore counterparts [20,21]. As compared with 
existing FPGA-based architectures for GHA, the proposed architecture has lower area cost and/or lower 
latency. The proposed architecture can be divided into three parts: the Synaptic Weight Updating (SWU) 
unit, the Principal Components Computing (PCC) unit, and the memory unit. The memory unit is the 
on-chip memory storing training vectors and synaptic weight vectors. Based on the data stored in the 
memory unit, the SWU and PCC units are then used to compute the principal components and update 
the synaptic weight vectors, respectively. 

In the SWU and PCC units, the input training vectors and synaptic weight vectors are separated into 
a number of non-overlapping blocks for principal component computation and synaptic weight vector 
updating. Both the SWU and PCC units operate one block at a time. In each unit, the operations of 
different blocks share the same circuit for reducing the area costs. Moreover, in the SWU unit, the 
results of precedent weight vectors will be used for the computation of subsequent weight vectors for 
reducing training time. 

To demonstrate the effectiveness of the proposed architecture, a texture classification system on a 
System-On-Programmable-Chip (SOPC) platform is constructed. The system consists of the proposed 
architecture, a softcore NIOS 11 processor [22], a DMA controller, and a SDRAM. The proposed 
architecture is adopted for finding the PCA transform by the GHA training, where the training vectors 
are stored in the SDRAM. The DMA controller is used for the DMA dehvery of the training vectors. 
The softcore processor is only used for coordinating the SOPC system. It does not participate the GHA 
training process. As compared with its multithreaded software counterpart running on Intel multicore 
processors, our system has lower computational time and lower power consumption for large training 
set. All these facts demonstrate the effectiveness of the proposed architecture. 

2. Preliminaries 

Figure 1 shows the neural model for GHA, where x(n) = [xi{n)^ . . . ^Xm{n)Y , and y(n) = 
. . . , yp{n)Y are the input and output vectors to the GHA model, respectively. In addition, m and 
p are the vector dimension and the number of Principal Components (PCs) for the GHA, respectively. 
The output vector y{n) is related to the input vector yi{n) by 

%(^) = ^Wji{n)Xi{n) (1) 

i=l 

where the Wji{n) stands for the weight from the i-th synapse to the j-th neuron at iteration n. 
Let 

Wj{n) = . . .,Wjmin)f,j = 1, . . . ,p (2) 

be the j-th synaptic weight vector. Each synaptic weight vector Wj(ra) is adapted by the Hebbian 
learning rule: 

j 

Wji{n + 1) = Wji{n) + r][yj{n)xi{n) - yj{n) ^ Wki{n)yk{n)] (3) 

k=l 
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where rj denotes the learning rate. After a large number of iterative computation and adaptation, 
Wj(n) will asymptotically approach to the eigenvector associated with the j-th eigenvalue \j of the 
covariance matrix of input vectors, where Ai > A2 > ■ ■ • > Ap. To reduce the complexity of computing 
implementation. Equation (3) can be rewritten as 



Wji{n + 1) = Wji{n) + r]yj{n)[xi{n) -'^Wki{n)yk{n)] 

k=l 

A more detailed discussion of GHA can be found in [13,14] 

Figure 1. The neural model for the GHA. 



(4) 




yiin) 



3. The Proposed GHA Architecture 

As shown in Figure 2, the proposed GHA architecture consists of three functional units: the memory 
unit, the Synaptic Weight Updating (SWU) unit, and the Principal Components Computing (PCC) unit. 
The memory unit is used for storing the current synaptic weight vectors and input vectors. Assume the 
current synaptic weight vectors Wj(n),j = 1, . . . ,p, are now stored in the memory unit. In addition, 
the input vector x(n) is available. Based on x(n) and Wj(n),j = 1, . . . ,p, the goal of PCC unit is to 



compute output vector y(n). Using x(n), y(n) andwj (n),j 



, p, the SWU unit produces the new 



synaptic weight vectors Wj{n + 1), j = 1, . . . ,p. It can be observed from Figure 2 that the new synaptic 
weight vectors will be stored back to the memory unit for subsequent training. 

Figure 2. The proposed GHA architecture. 
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3.1. SWUUnit 

The design of SWU unit is based on Equation (4). Although the direct implementation of Equation (4) 
is possible, it will consume large hardware resources. To further elaborate this fact, we first see from 
Equation (4) that the computation of Wji{n + 1) and Wri{n + 1) shares the same term J2k=i Wki{n)yk{n) 
when r < j. Consequently, independent implementation of Wji{n + 1) and Wri{n + 1) by hardware using 
Equation (4) will result in large hardware resource overhead. 

To reduce the resource consumption, we first define a vector Zji{n) as 

j 

Zji{n) = Xi{n) - '^Wki{n)yk{n),j = l,...,p (5) 
fc=i 

and Zj{n) = [zji{n), . . . , zj^in)]'^. Integrating Equation (4) and (5), we obtain 

Wji{n + 1) = Wji{n) + r]yj{n)zji{n) (6) 
where Zji{n) can be obtained from 2;(j_i),;(n) by 

Zji{n) = Z(j-i)i{n) - Wji{n)yj{n)J = 2, . . . ,p (7) 
When J = 1, from Equations (5) and (7), it follows that 

Zoi{n) = Xi{n) (8) 

Figure 3 depicts the hardware implementation of Equations (6) and (7). As shown in the figure, the 
SWU unit produces one synaptic weight vector at a time. The computation of Wj{n + 1), the j-th weight 
vector at the iteration 72 + 1, requires the Zj_i(n),'y{n) andwj(n) as inputs. In addition to Wj(n + 1), the 
SWU unit also produces Zj(n), which will then be used for the computation of Wj+i(n + 1). Hardware 
resource consumption can then be effectively reduced. 

Figure 3. The hardware implementation of Equations (6) and (7). 
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One way to implement the SWU unit is to produce Wj(?2 + 1) and 'z.jin) in one shot. However, m 
identical modules, individually shown in Figure 4, may be required because the dimension of vectors is 
m. The area costs of the SWU unit then grow linearly with m. To further reduce the area costs, each of 
the output vectors Wj{n + 1) and Zj(n) is separated into h blocks, where each block contains q elements. 
The SWU unit only computes one block of ^j{n + 1) and 7,j{n) at a time. Therefore, it will take h clock 
cycles to produce complete Wj{n + 1) and Zj(n). 

Let 

= [wj,(k~i)g+i{n), Wj^^k-i)q+qin)]'^, k = l,...,b (9) 

and 

Zj,ki^) = [zj,(k-i)g+iin), . . . , Zj^(k-i)q+qin)f, /c = 1, . . . , 6 (10) 

be the fc-th block of Wj(n) and Zj{n), respectively. The computation Wj(n + 1) and Zj(n) take b clock 
cycles. At the k-th clock cycle, k = 1, . . . ,b, the SWU unit computes Wj^k{n + 1) and Zj^k{n). Because 
each of ^(ra + 1) and Zj ^(n) contains only q elements, the SWU unit consists of q identical modules. 
The architecture of each module is also shown in Figure 4. The SWU unit can be used for GHA with 
different vector dimension m. As m increases, the area costs therefore remain the same at the expense 
of a larger number of clock cycles b for the computation of Wj ^(n + 1) and Zj ^(n). 

Figure 4. The architecture of each module in the SWU unit. 




Based on Equation (8), the input vector zo(n) is actually the training vector x(n), which is also 
separated into b blocks, where the A;-th block is given by 



^^0,k{^) = [^{k-l)q+l{n), X(^k-l)q+q{n)f, k 



(11) 
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The ZQkin) and wi ^(n), k = 1, . . . ,b, are used as the input vectors for the computation of zi ^(ri) 
and wi + 1), A; = 1, . . . , 6. The zi(n) and wi(n + 1) become available when all the zi^fc(ri) and 
^i,k{n' + 1), k = 1, . . . ,b, are obtained. Figure 5 shows the computation of zi 1(77,) and wi i(n + 1) 
based on ZQ^i(n) and wi 

Figure 5. The SWU unit operation for computing the first segment of wi(n + 1). 
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After the computation of wi(n + 1) and zi(n) are completed, the vector zi{n) is then used for the 
computation of Z2(n) and W2(n + 1). The vector 22(71) is then used for the computation of W3(n + 1). 
The weight vector updating process at the iteration n + 1 will not be completed until the SWU unit 
produces the weight vector Wp(n + 1). 

3.2. PCC Unit 

The PCC operations are based on Equation (1). Therefore, the PCC unit of the proposed architecture 
contains adders and multipliers. Because the number of multipliers grows with the vector dimension m, 
the direct implementation using Equation (1) may consume large hardware resources when m becomes 
large. Similar to the SWU unit, the block based computation is used for reducing the area costs. Based 
on Equations (9) and (11), the Equation (1) can be rewritten as 

h q b 

VM) = ^^Wj^^k-i)q+i{n)x^k-i)q+i{n), = ^ wJfc(n)zo,fc(n) (12) 

fc=l i=l k=l 

The implementation of Equation (12) needs only q multipliers, a g-input adder, an accumulator, and 
a p-entry buffer, as shown in Figure 6. The multipliers and the g-input adder are organized as a s- stage 
pipeline for enhancing the throughput of the circuit. 
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Figure 6. The PCC unit architecture. 
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The blocks Wj ^(n) and zo,a:(^) are the inputs to the PCC unit. Figure 6 also shows the operation of 
PCC unit when the input vectors are Wj i{n) and zo,i(n). Note that the output of the accumulator in the 
circuit becomes yj{n) only after all the blocks ^(n) and zo,fc(^), k = 1, . . . , 6, have been fetched from 
the memory unit. The computation of each yj{n) therefore takes h + s cycles. After the computation of 
yj{n) is completed, yj{n) will be stored in the j-th entry of the buffer for the subsequent computation of 
Wj (n + 1) in the SWU unit. 

3.3. Memory Unit 

The memory unit contains three buffers: Buffer A, Buffer B and Buffer C. Buffer A fetches and stores 
training vector x(n) from the main memory. Buffer B contains Zj(n) for the computation in PCC and 
SWU units. The synaptic weight vectors Wj(n) are stored in Buffer C. All the buffers are shift registers. 

To fetch training vector x(n) from main memory, the m elements in the training vector are interleaved 
and separated into q segments. Each segment contains h elements. Therefore, Buffer A is a g-stage shift 
register, where each stage contains h cells, as shown in Figure 7. Upon all the q segments are received, 
they are copied to Buffer B as zo(n). 

The architecture of Buffer B is depicted in Figure 8. It holds the values of Zj (n) for the computation in 
PCC and SWU units. The data in Buffer B is initialized by Buffer A. That is, the initial content of Buffer 
B is x(n) {i.e., zo(n)). As shown in Figure 9, Buffer B then provides h blocks zo,fc(n), k = 1, . . . ,b, 
sequentially to PCC unit for the computation of yj{n). Because zo(n) are used for the operations in PCC 
and SWU units, all the data output to PCC unit is also rotated back to Buffer B. 
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Figure 7. The Buffer A architecture in memory unit. 
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Figure 8. The Buffer B architecture in memory unit. 
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Figure 9. The Buffer B operation for the PCC unit. 
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After the PCC computation is completed, the Buffer B then delivers data for SWU unit. Starting 
from zo(n), the Buffer B provides Zj(n) to SWU unit, and then receives Zj+i(n) from SWU unit for 
j = 0, . . . ,p — 1. The delivery of Zj(n) and collection of Zj^i(n) are on a block-by-block basis, as 
depicted in Figure 10. 



Figure 10. The Buffer B operation for the SWU unit. 
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The Buffer C contains the synaptic weight vectors Wj(n), j = 1, . . . ,p. In addition to providing 
and storing data for the computation in PCC and SWU units, it also holds the final results after GHA 
training. Figure 1 1 shows the architecture of Buffer C. Similar to Buffer B, each synaptic weight vectors 
Wj (?7,) is divided into h blocks. They are dehvered to PCC unit sequentially for the computation of Uj (n) . 
Moreover, since Wj (n) is also needed for the computation of Wj{n + 1) in the SWU unit, the b blocks 
delivered to the PCC unit should also be rotated back to Buffer C. Figure 12 shows the operation of 
Buffer C for computation in PCC unit. 

Figure 11. The Buffer C architecture. 
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Figure 12. The Buffer C operation for the PCC unit. 
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To support the computation in SWU unit, the Buffer C delivers w^j(n) to SWU unit,and then receives 
Wj{n + 1) from the unit. The delivery of Wj(n) and collection of ^j{n + 1) are also on a block-by-block 
basis, as depicted in Figure 13. 
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Figure 13. The Buffer C operation for the SWU unit. 
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Based on the operations of the memory unit, Figure 14 shows the timing diagram of the proposed 
architecture. It can be observed from the figure that the Buffer A is operated concurrently with Buffers 
B and C. That is, while the proposed architecture is fetching the training vector x(n + 1) to Buffer A, it 
is also computing yjin) and Wj{n + 1) based on yi{n) and w(n). Fetching training vectors may be a time 
consuming process as vector dimension grows. Therefore, parallel operations of training vector fetching 
and weight vector computation are beneficial for increasing the GHA training speed. 

3.4. SOPC-Based GHA Training System 

The proposed architecture is used as a custom user logic in a SOPC system consisting of softcore 
NIOS CPU [22], DMA controller and SDRAM, as depicted in Figure 15. All training vectors are 
stored in the SDRAM and then transported to the proposed circuit via the Avalon bus. The DMA-based 
training data delivery is performed so that the memory access overhead can be minimized. The softcore 
NIOS CPU runs on a simple software to support the proposed circuit for GHA training. The software 
is used only for coordinating different components in the SOPC platform. It does not involve GHA 
computations. As the delivery of the training vectors is completed, the softcore CPU then retrieves the 
training results from proposed architecture for subsequent classification operations. 
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Figure 14. The timing diagram for the operations of the proposed architecture: (a) q > 

2bp + s; (b) q <2bp + s. 
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Figure 15. The SOPC system for implementing GHA. 
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Figure 16 depicts the interface of the proposed architecture to the SOPC system. The interface consists 
of an interface buffer for transferring data between the proposed GHA architecture and the SOPC system. 
The proposed GHA architecture contains a simple controller for accessing the interface. Figure 17 
depicts the operations of the controller. As shown in Figure 17, the proposed circuit fetches the training 
vectors from the interface buffer to Buffer A for subsequent processing. In addition, after the completion 
of training, the synaptic weight vectors in Buffer C are dehvered to the interface buffer so that they can 
be accessed by the NIOS CPU. 

Figure 16. The interface of the proposed architecture to the SOPC system. 
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Figure 17. The operation of the controller of the proposed architecture. 
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4. Performance Analysis and Experimental Results 

The area complexities and latency are the major performances considered in this study. Because 
adders, multipliers and registers are the basic building blocks of the GHA architecture, the area 
complexities are separated into three categories: the number of adders, the number of multipliers and the 
number of registers. Given the current synaptic weight vectors Wj(n), j = 1, . . . ,p, the latency of the 
proposed GHA architecture is defined as the time required to produce the new synaptic weight vectors 
Wj{n + l),j = 

Table 1 shows the area complexities and latency of various architectures for GHA training. It can 
be observed from the table that the number of adders and multipliers of the proposed architecture are 
independent of the vector dimension m and the number of principal components p. By contrast, the area 
costs of [18] grow with both m and p. We can also see from the table that the latency of [19] increases 
with both m and p. Based on the timing diagram shown in Figure 14, the latency of the proposed 
architecture is max{q,2bp + s). Therefore, it is independent of vector dimension m. The proposed 
architecture is then well suited for applications requiring large vector dimension m. 



Table 1. Performance analysis of various architectures for GHA training. 
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Next we consider the physical implementation of the proposed architecture. The design platform 
is Altera Quartus II with SOPC Builder [23] and NIOS II IDE. Table 2 show the hardware resource 
consumption of the proposed architecture for vector dimensions m = 16 x 16 and m = 32 x 32, 
respectively. The hardware resource utilization of the entire SOPC systems is revealed in Table 3. In 
order to maintain low area cost, we use fixed-point format to represent data. The length of the format 
is signed 8 bits. The target FPGA device is Altera Cyclone IV EP4CGX150DF31C7. The number of 
modules g is 64 for all the implementations shown in the tables. 

Three different area resources are considered in the tables: Logic Elements (LEs), embedded memory 
bits, and embedded multipliers. The LEs are used for the implementation of adders, multipliers and 
registers in the proposed GHA architecture. Both the LEs and embedded memory bits are also used 
for the implementation of NIOS CPU of the SOPC system. The embedded multipliers are used for the 
implementation of the multipliers of the proposed GHA architecture. 

It can be observed from Tables 2 and 3 that the consumption of embedded multiplier of the proposed 
architecture is independent of the vector dimension m and number of principal components p. Because 
the embedded multipliers are used only for the implementation of multiplier in the proposed architecture, 
they are dependent only on q. In the experiment, all the implementations in Tables 2 and 3 have the same 
q. Therefore, all the implementations utilize the same number of embedded multipliers. 



Sensors 2012, 12 



6260 



Table 2. Hardware resource consumption of the proposed GHA architecture for vector 
dimensions m = 16 x 16 and m = 32 x 32. 



Proposed GHA with m = 16 x 16 Proposed GHA with m = 32 x 32 



p 


LEs 


Memory Bits 


Embedded 
Multipliers 


LEs 


Memory Bits 


Embedded 
Multipliers 


3 


35,386/149,760 


0/6,635,520 


704/720 


85,271/149,760 


7, 168/6,635,520 


704/720 


4 


37,731/149,760 


0/6,635,520 


704/720 


94,244/149,760 


7, 168/6,635,520 


704/720 


5 


40,043/149,760 


7,168/6,635,520 


704/720 


103,394/149,760 


7, 168/6,635,520 


704/720 


6 


42,404/149,760 


7,168/6,635,520 


704/720 


112,679/149,760 


7, 168/6,635,520 


704/720 


7 


44,737/149,760 


7,168/6,635,520 


704/720 


121,940/149,760 


7, 168/6,635,520 


704/720 



Table 3. Hardware resource consumption of the SOPC system using proposed GHA 
architecture as hardware accelerator for vector dimensions m = 16 x 16 and m = 32 x 32. 



Proposed SOPC with m = 16 x 16 Proposed SOPC with m = 32 x 32 



p 


LEs 


Memory Bits 


Embedded 
Multipliers 


LEs 


Memory Bits 


Embedded 
Multipliers 


3 


44,377/149,760 


446,824/6,635,520 


708/720 


94,736/149,760 


453,992/6,635,520 


708/720 


4 


46,786/149,760 


446,824/6,635,520 


708/720 


103,968/149,760 


453,992/6,635,520 


708/720 


5 


49,096/149,760 


453,992/6,635,520 


708/720 


113,207/149,760 


453,992/6,635,520 


708/720 


6 


51,449/149,760 


453,992/6,635,520 


708/720 


122,537/149,760 


453,992/6,635,520 


708/720 


7 


54,055/149,760 


453,992/6,635,520 


708/720 


131,779/149,760 


453,992/6,635,520 


708/720 



Because the embedded memory bits are mainly used only for the realization of NIOS CPU, the 
consumption of embedded memory bits are also independent of m and p, as shown in Tables 2 and 3. It 
can be observed from the tables that the consumption of LEs grows with m and p. It is not surprising 
because the LEs are used to design the registers. Moreover, the number of registers increases with m 
and p, as shown in Table 1 . Therefore, the numerical results shown in Tables 2 and 3 are consistent with 
the analytical results in Table 1 . 

Figures 18 and 19 show the Classification Success Rate (CSR) distribution of the proposed 
architecture for the textures shown in Figures 20 and 21, respectively. The CSR is defined as the number 
of test vectors which are successfully classified divided by the total number of test vectors. The number 
of principal components is p = 4. The vector dimensions are m = 16 x 16 and 32 x 32. The distribution 
for each vector dimension is based on 20 independent GHA training processes. The CSR distribution of 
the architecture presented in [18] with the same p is also included for comparison purpose. The vector 
dimension for [18] is m = 4 x 4. 
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Figure 18. The CSR distributions of the proposed architecture for the texture set shown in 
Figure 20. 




Figure 19. The CSR distributions of the proposed architecture for the texture set shown in 
Figure 21. 
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Figure 20. The set of textures for CSR measurements in Figure 18. 




Figure 21. The set of textures for CSR measurements in Figure 19. 




The size of each texture in Figures 20 and 21 is 576 x 576. In the experiment, the Principal Component 
based k Nearest Neighbor (PC-/cNN) rule is adopted for texture classification. Two steps are involved in 
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the PC-/cNN rule. In the first step, the GHA is applied to the input vectors to transform m dimensional 
data into p principal components. The synaptic weight vectors after the convergence of GHA training 
are adopted to span the linear transformation matrix. In the second step, the A;NN method is applied to 
the principal subspace for texture classification. 

It can be observed from Figures 18 and 19 that the proposed architecture has better CSR. This 
is because the vector dimensions of the proposed architecture are higher than those in [18]. Spatial 
information of textures therefore can be effectively exploited. The proposed architecture is able to 
implement the hardware GHA training with vector dimension up to m = 32 x 32. The hardware 
realization for m = 32 x 32 is possible because the area costs of the SWU and PCC units in the proposed 
architecture are independent of vector dimension. By contrast, the area costs of the SWU and PCC units 
in [18] grow with the vector dimension. Therefore, only smaller vector dimension {i.e., m = 4 x 4) can 
be implemented. 

Although the proposed architecture is based on signed 8-bit fixed point format, the degradation in 
CSR is small as compared with the GHA without truncation. Figure 22 reveals the truncation effects 
of the proposed architecture. The GHA implementation without truncation is implemented by software 
with floating-point format. The training images for this experiment is shown in Figure 20. The vector 
dimension is 32 x 32. The distribution for each format is based on 20 independent GHA training 
processes. It can be observed from Figure 22 that only a slight decrease in CSR is observed for the 
fixed-point format. In fact, the average CSR degradation is only 3.44% (from average CSR 95.53% for 
floating-point format to 92.09% for fixed point format). 

Figure 22. The CSR distribution of GHA with fixed and floating point format. 



CSR distributions 




70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 

Classification Success Rate (%) 



89 90 91 92 93 94 95 
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Another advantage of the proposed architecture is its superior computational capacity for GHA 
training. Figure 23 shows the CPU time of the NIOS-based SOPC system using the proposed architecture 
as a hardware accelerator for various numbers of training iterations with m = 16 x 16 and p = 7. The 
NIOS CPU clock rate in the system is 50 MHz. The target FPGA for the implementation is Cyclone III 
EP3C120F780C8. The CPU time of the software counterparts running on the general purpose 1.6 GHz 
Intel 15 and 2.8 GHz Intel 17 processors also are depicted in the Figure 23 for comparison purpose. The 
software implementations are multithreaded to take advantages of all the cores in the processors. There 
are 16 threads in the codes: 8 threads for synaptic weight updating, and 8 threads for the principal 
component computation and others. An optimizing compiler (offered by Visual Studio) is used to 
further enhance the computational speed. It can be clearly observed from Figure 23 that the proposed 
architecture attains high speed up over its software counterparts. In particular, when the number of 
training iterations reaches 1000, the CPU time of the proposed SOPC system is 733.14 ms. By contrast, 
the CPU time of Intel 17 is 1,0125.37 ms. The speedup of proposed architecture over the software 
counterpart is therefore 13.81. 

Figure 23. The CPU time of the NIOS-based SOPC system using the proposed architecture 
as the hardware accelerator for various numbers of training iterations with m = 16 x 16 and 
p = 7. 
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The proposed architecture has superior speed performance over its software counterparts because 
there are limitations for exploiting the thread level parallelism. The GHA is an incremental training 
algorithm. Therefore, it is difficult to exploit parallelism among the computations for different training 
vectors. The inherent data dependency among different GHA stages (e.g., between principal component 
computation and weight vector updating) may slow down the computation speed due to costly data 
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forwarding via shared memory. Moreover, the inputs {i.e., x(n) and Wj(n), j = 1, . . . ,p) and outputs 
(i.e., y{n), 'Wj{n + 1), j = 1, . . . ,p) of the algorithms are all vectors with large dimension. Large 
number of memory accesses required by GHA is another limiting factor for performance enhancement 
of software implementations. By contrast, the proposed architecture is able to perform data forwarding 
and memory accesses in an efficient manner. The employment of Buffers A, B and C allows the parallel 
operations of training vector fetching and weight vector computation. The latency for memory access 
can then be concealed. Moreover, the Buffers B and C are also designed for fast data forwarding 
between principal computation and weight vector updating without complicated memory management 
and external memory accesses. 

In addition to having superior computational speed, the proposed architecture consumes lower power. 
Table 4 shows the power consumption of various GHA implementations. For the power estimation 
of GHA software implementations, the tool Joulemeter (developed by Microsoft Research) [24] is 
used. The tool is able to estimate the power consumed by CPU for a specific application. The power 
consumption of other parts of a computer such as main memory and monitor therefore can be excluded 
for comparisons. The power consumed by the proposed architecture is estimated by the PowerPlay Power 
Analyzer Tool [25] provided by Altera. From Table 4, it can be observed that the power consumption 
of the proposed architecture is only 0.4% of that of Intel 17 processor for GHA training (i.e., 0.129 W 
versus 31.656 W). As compared with the low power multicore processor Intel 15 for laptop computers, 
the proposed architecture also has significantly lower power dissipation (i.e., 0.129 Wversus 1.292 W). 



Table 4. Power Consumption of Various GHA Implementations. 



GHA 


Proposed 


Multithreaded 


Multithreaded 


Implementations 


Architecture 


Software (16 threads) 


Software (16 threads) 


Multicore Processor 




Intel iV 


Intel 15 


FPGA Device 


Altera Cyclone III 








EP3C120F780C8 






Clock rate 


50 MHz 


2.8 GHz 


1.6 GHz 


Estimated Power 


0.129 W 


31.656 W 


1.292 W 



Table 5 compares the computation speed of various GHA architectures implemented by FPGA. 
Similar to Figure 23, the computation time of the proposed architecture is measured as the CPU time 
of the NIOS processor using the proposed architecture as the hardware accelerator. The clock rate of 
NIOS CPU in the system is 100 MHz. The vector dimension and the number of principal components 
associated with the proposed architecture are m = 16 x 16 and p = 16, respectively. The computation 
time of architectures in [18,19] with different m and/or p values are also included in the table. 
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Table 5. Computation Time of Various GHA Architectures. 



Architectures 


Proposed Architecture 


v-t on 

[18] 


[19] 


Vector Dimension m 


16 X 16 


4x4 


16 X 8 


# of Principal Components p 


16 


4 


16 


FPGA Device 


Altera Cyclone III 


Altera Cyclone III 


Xilinx Virtex 4 




EP3C120F780C8 


EP3C120F780C8 


XC4VFX12 


Clock Rate 


100 MHz 


75 MHz 


136.243 MHz 


Iteration Numbers 


100 


100 


100 


# of Training Vectors per Iteration 


888 X 8 


888 X 8 


888 X 8 


Computation Time 


1.369 s 


86.58 ms 


2.09 s 



Note that direct comparisons of these architectures may be difficult because the speed of these 
architectures are measured on different FPGA devices with different m, p and/or clock rates. To show the 
superiority of the proposed architecture, the comparisons are based on the same training size {i.e. , number 
of training vectors per iteration) and number of iterations. With larger vector dimension (i.e., 16 x 16 
versus 16 x 8), slower clock rate (i.e., 100 MHz versus 136.243 M Hz), and the same number of principal 
components (i.e., p = 16), it can be observed from Table 5 that the proposed architecture still has faster 
computation speed as compared with the architecture in [19]. Although the architecture in [18] has 
fastest computation time, the architecture is suitable only for small vector dimension (i.e., m = 4 x 4) 
and small number of principal components (i.e., p = 4). All these facts demonstrate the effectiveness of 
the proposed architecture. 

5. Concluding Remarks 

Experimental results reveal that the proposed GHA architecture has superior speed performance 
over its software counterparts and other GHA architectures. With lower clock rate and higher vector 
dimension, the proposed architecture still has faster computation speed over the architecture in [19]. 
In addition, the architecture is able to attain higher CSR for texture classification as compared with 
other GHA architectures. In fact, all the CSRs are above 90% for all the experiments considered in 
this paper. The proposed architecture also has low area costs for fast PCA analysis with high vector 
dimension up to m = 32 x 32. The utilization of memory bits and embedded multipliers for FPGA 
implementation are independent of the vector dimension and the number of principal components. The 
proposed architecture therefore is an effective alternative for on-chip learning applications requiring low 
area costs, high classification success rate and high speed computation. 
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